-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PARQUET-2474: Add FIXED_SIZE_LIST logical type #241
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting way to get lists without repetition.
LogicalTypes.md
Outdated
### FIXED_SIZE_LIST | ||
|
||
The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements | ||
of a primitive data type. It must annotate a `binary` primitive type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"binary" means either fixed or variable length, right? I always get confused 😅.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please provide a concrete example on how the list is structured? What about their definition & repetition levels? Intuitively, I thought not limit it to binary type. For example, it would be possible to support something like int[N] or double[N] and even multi-dimensional list like int[M][N].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps use byte_array
in this PR (see #251).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please provide a concrete example on how the list is structured? What about their definition & repetition levels? Intuitively, I thought not limit it to binary type. For example, it would be possible to support something like int[N] or double[N] and even multi-dimensional list like int[M][N].
I would represent the fixed sized list as a non-nested FIXED_LEN_BYTE_ARRAY
+ type
+ num_values
. Multidimensional lists/arrays bring much more complexity that I'm not sure makes sense to store as a logical type (see FixedShapeTensor in Arrow). Also see #241 (comment).
Perhaps use
byte_array
in this PR (see #251).
Done.
One thing to perhaps give thought to is how this might represent nested lists, say you wanted to encode a m by n matrix, would you just encode this as a I had perhaps been anticipating that fixed size list would be a variant of "REPEATED" as opposed to a physical type, that is just able to avoid incrementing the max_def_level and max_rep_level. This would make it significantly more flexible I think, although I concede it will make it harder to implement. |
cc @JFinis |
src/main/thrift/parquet.thrift
Outdated
struct EnumType {} // allowed for BINARY, must be encoded with UTF-8 | ||
struct DateType {} // allowed for INT32 | ||
struct Float16Type {} // allowed for FIXED[2], must encoded raw FLOAT16 bytes | ||
struct FixedSizeListType {} // see LogicalTypes.md |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something is missing here. Shouldn't this type contain the element type? And the length of the list? The length of the list could be deduced from the size of the underlying fixed_len_byte_array
, but at least the element type would be necessary then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to:
struct FixedSizeListType { // allowed for FIXED_LEN_BYTE_ARRAY[num_values * width of type],
1: required Type type; // see LogicalTypes.md
2: required i32 num_values;
}
struct VariableSizeListType { // allowed for BYTE_ARRAY, see LogicalTypes.md
1: required Type type;
}
@@ -255,6 +255,16 @@ The primitive type is a 2-byte fixed length binary. | |||
|
|||
The sort order for `FLOAT16` is signed (with special handling of NANs and signed zeros); it uses the same [logic](https://github.com/apache/parquet-format#sort-order) as `FLOAT` and `DOUBLE`. | |||
|
|||
### FIXED_SIZE_LIST |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting choice to annotate a binary primitive field instead of a repeated group field. I see pros and cons with this design:
PROs:
- Guarantees zero-copy, as the layout is defined to be just bytes. In contrast, would this annotate a group, a writer could decide to use a fancy per-value encoding (e.g., dictionary) and thus create a list that first has to be "decoded" before it can be used.
- Guarantees that a list is always contained on one page instead of being split over multiple pages. Again, this helps in keeping decoders easy and guaranteeing zero copy.
- This solves the problem of redundant R-Levels. Since it's just a primitive column, no r-level considerations have to be taken into account.
CONs:
- Cannot create fixed size lists of nested types (e.g., list of structs). I see that this isn't necessary for tensors or embedding vectors, but shouldn't the feature be extensible for other scenarios as well? This limits the composability of the feature. I can now create a struct of fixed size lists, but not a fixed size list of structs.
- Cannot have null elements in fixed size lists. This might not be desired for all lists, but there can be use cases where having null values in them is preferrable.
- Parquet has a concept for (non-fixed size) lists. It is conceptually weird that fixed size lists are totally different from (non-fixed size) lists.
I think the PROs outweigh the CONs here, so I think this is fine with me. I just want everyone to be aware about the ramifications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @tustvold, as you also brought up this point. I agree that having a new property of a repeated group would be more flexible, but it also comes at some cost, as outlined above. Also, it couldn't be just a logical type in this case, as a logical type cannot change the handling of R-Levels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm now feeling that maybe wrapping a Vector[PrimitiveType, Size]
is also ok, but currently representing this is a bitweird in the model. May I ask would a Vector
having data below?
1. [1, 1, 1], [null, 1, 1] <-- data with null
2. null, [1, 1, 1] <-- null vector
And would vector contains a "nested" vector?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This solves the problem of redundant R-Levels. Since it's just a primitive column, no r-level considerations have to be taken into account.
This is the main reason I'd like to propose this type, see apache/arrow#34510.
- Cannot create fixed size lists of nested types (e.g., list of structs). I see that this isn't necessary for tensors or embedding vectors, but shouldn't the feature be extensible for other scenarios as well? This limits the composability of the feature. I can now create a struct of fixed size lists, but not a fixed size list of structs.
Lack of composability is a downside, but I think it's still worth the compromise. I've not seen need for fixed_size_list(struct) in tensor computing, but that's probably just because it's not available.
- Cannot have null elements in fixed size lists. This might not be desired for all lists, but there can be use cases where having null values in them is preferrable.
In tensor computation this is usually addressed with bitmasks, which can be stored as a fixed_size_list(binary, num_values)
.
- Parquet has a concept for (non-fixed size) lists. It is conceptually weird that fixed size lists are totally different from (non-fixed size) lists.
Perhaps we should call this type FixedSizeArray
to disambiguate?
I'm now feeling that maybe wrapping a
Vector[PrimitiveType, Size]
is also ok, but currently representing this is a bitweird in the model. May I ask would aVector
having data below?1. [1, 1, 1], [null, 1, 1] <-- data with null 2. null, [1, 1, 1] <-- null vector
And would vector contains a "nested" vector?
I think case 2. is ok, but case 1. should be expressed with a separate null bitmask that's not part of the type.
Apologies for taking a while to reply. I've split this into two cases:
We could start with a more general multidimensional array definition and have list be a 1 dimensional array. Additional metadata required would not be that bad. I'm just a bit scared of validation and striding logic bleeding into parquet implementations. Do we have any other inputs / opinions?
That's interesting. What would you expect performance wise with this approach? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good to me. Just a few questions/comments. Thanks!
LogicalTypes.md
Outdated
The `FIXED_LEN_BYTE_ARRAY` data is interpreted as a fixed size sequence of | ||
elements of the same primitive data type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the encoding be defined as well, for instance the elements of the array are encoded in the same manner as PLAIN
encoding?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that seems like a thing to specify. Changed to:
The `FIXED_LEN_BYTE_ARRAY` data is interpreted as a fixed size sequence of
elements of the same primitive data type encoded with plain encoding.
LogicalTypes.md
Outdated
### FIXED_SIZE_LIST | ||
|
||
The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements | ||
of a primitive data type. It must annotate a `FIXED_LEN_BYTE_ARRAY` primitive type. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As written, the elements can themselves be arrays. Is this intended? Or should it be "non-array primitive data type"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't really consider the possibility of elements being arrays and I think non-array limitation makes sense. Changed to:
The `FIXED_SIZE_LIST` annotation represents a fixed-size list of elements
of a non-array primitive data type. It must annotate a `FIXED_LEN_BYTE_ARRAY` primitive type.
Thanks for the review @etseidl ! I've updated this with your suggestions. |
@ritchie46 would this be useful for your new polars Array type? |
As proposed in apache/arrow#34510 and on ML, PARQUET-2474.
Arrow recently introduced FixedShapeTensor and VariableShapeTensor canonical extension types that use FixedSizeList and StructArray(List, FixedSizeList) as storage respectfully. These are targeted at machine learning and scientific applications that deal with large datasets and would benefit from using Parquet as on disk storage.
However currently FixedSizeList is stored as List in Parquet which adds significant conversion overhead when reading and writing as discussed here. It would therefore be beneficial to introduce a FIXED_SIZE_LIST logical type to Parquet.